Cox proportional hazards regression in small studies of predictive biomarkers

Predictive biomarkers are essential for personalized medicine since they select the best treatment for a specific patient. However, of all biomarkers that are evaluated, only few are eventually used in clinical practice. Many promising biomarkers may be erroneously abandoned because they are investigated in small studies using standard statistical techniques which can cause small sample bias or lack of power. The standard technique for failure time endpoints is Cox proportional hazards regression with a multiplicative interaction term between binary variables of biomarker and treatment. Properties of this model in small studies have not been evaluated so far, therefore we performed a simulation study to understand its small sample behavior. As a remedy, we applied a Firth correction to the score function of the Cox model and obtained confidence intervals (CI) using a profile likelihood (PL) approach. These methods are generally recommended for small studies of different design. Our results show that a Cox model estimates the biomarker-treatment interaction term and the treatment effect in one of the biomarker subgroups with bias, and overestimates their standard errors. Bias is however reduced and power is increased with Firth correction and PL CIs. Hence, the modified Cox model and PL CI should be used instead of a standard Cox model with Wald based CI in small studies of predictive biomarkers.

In past decades, much research focused on identifying characteristics of tumors and patients to optimize anticancer therapy in individual patients 1 .To improve tumor response, information like germline and tumor genetic variability, tumor (immune) environment, and lifestyle and comorbidities of patients diagnosed with cancer can be taken into account 2 .A characteristic that identifies patients who require additional systemic therapy besides local therapy (surgery, radiotherapy), i.e., indicates who needs additional therapy, is a prognostic biomarker.A characteristic that selects the most promising treatment for a specific patient, i.e., indicates how one should be treated, is a predictive biomarker 3 .Thus, predictive biomarkers are essential for personalized medicine.
Of the many evaluated biomarkers, only few reach clinical practice because of many challenges during the translational phase 4,5 .One possible concern might be the use of suboptimal statistical methods for the available biomarker data.
To identify a binary predictive biomarker, application of a statistical interaction test between the biomarker and the treatment is recommended to evaluate whether a relative benefit from a specific experimental treatment compared with a control treatment differs by biomarker level 6,7 .However, to guide a treatment choice, a qualitative rather than a quantitative interaction is needed.A qualitative interaction is present when an experimental treatment is not superior to a control treatment (i.e., is equally efficacious or worse) at one biomarker level but is superior at the other biomarker level.A quantitative interaction is present when an experimental treatment is superior to a control treatment in both biomarker levels but the magnitude of the treatment benefit differs in these subgroups 8 .
A commonly used statistical method for failure time data is the Cox proportional hazards model with a multiplicative interaction term between indicator variables of biomarker and treatment in a cohort of suitable patients 9 .Unfortunately, to obtain unbiased estimates and to detect statistically significant interactions of moderate size with sufficient statistical power, a large number of patients is required 8 , which is often not available in biomarker studies.Even if available, limited research budgets may prohibit the often costly measurement of the biomarker status in a large cohort.Consequently, too small patient series are interrogated and small sample bias as well as lack of power can lead to inconclusive results and thus perhaps to abandoning a promising biomarker.
In order to understand the small sample behavior of the Cox model 9 for interaction analyses, we performed a simulation study in settings similar to the results of existing clinical studies on breast cancer (BC) [10][11][12][13][14] .We focused on properties of the biomarker-treatment interaction estimate.Additionally, we evaluated estimates of the treatment effects by biomarker level.Results of a standard Cox model were compared with a Firth-corrected Cox model, i.e., a Cox model with a modified score function 15 .Profile likelihood (PL) and Wald confidence intervals (CI) were also compared.To our knowledge, the Firth correction and PL CIs have not yet been investigated for a Cox model with an interaction term.However, they were evaluated to overcome the asymptotic bias for the estimation of prognostic effects of covariates.Heinze and Schemper 16 and Heinze 17 demonstrated that in small samples the Firth-corrected Cox model was superior over a standard Cox model, especially in scenarios with heavy censoring and strong covariate effects on survival.They also showed that inference should be based on PL CIs rather than on Wald-type CIs.Therefore, the aim of our study was to replicate the results of Heinze and Schemper 16 and Heinze 17 in settings specific to studies on predictive biomarkers.We evaluated and compared results of studies with protective, null and harmful biomarker effects and biomarkers of varying prevalence, as well as studies with varying strengths and directions of the association between the biomarker and the treatment.Our focus was to find scenarios that could indicate when data on predictive biomarker have to be analyzed with modifications of the standard Cox model.

Data generation
The simulation study design followed the recommendations of Morris et al. 18 and methods described by Bender et al. 19 .N datasets with n patients each were generated.All patients were assigned to one of four combinations of biological marker M (low level: M = 0 ; high level: M = 1 ) and treatment T (standard treatment: T = 0 ; experi- mental treatment: T = 1 ) according to a multinomial distribution with probabilities p 00 (low marker level and standard treatment), p 10 (high marker level and standard treatment), p 01 (low marker level and experimental treatment), p 11 (high marker level and experimental treatment).The probabilities were calculated as functions of the proportion p T of patients treated with the experimental treatment, the proportion p M of patients with high marker level, and the odds ratio OR MT of the association between marker and treatment 20 .
Event times t e were generated as where U e was a random uniform variable on the interval [0, 1], MT was the product of M and T, exp (β M ) = HR M was the marker hazard ratio (HR) for high vs. low marker level among patients receiving standard treatment, exp (β T ) = HR T was the treatment HR for experimental vs. standard treatment among patients with low marker level, and exp (β I ) = HR I was the interaction HR between treatment effects in high vs. low marker level.Param- eter e was a scale parameter of an exponential distribution with survival function S(t) = exp (− e t) used to calculate baseline survival, i.e., survival of patients with low marker level receiving standard treatment.The parameter was defined as so that the baseline proportion of events before the end of follow-up t end was p e , i.e., S(t end ) = 1 − p e .Censoring time t c was generated similarly with U c as a random uniform variable on the interval [0, 1], scale parameter c , the proportion p c of patients with low marker level receiving standard treatment censored before t end (exclud- ing administrative censoring at the end of the study period) and β M = β T = β I = 0 to achieve non-differential censoring by marker and treatment.If t e ≤ min (t c , t end ), the patient was specified as experiencing an event at t e .Otherwise, the patient was censored at min (t c , t end ).Datasets with different values for n, p M , p c , OR MT , HR M , HR I and N = 10000 , p T = 0.5 , p e = 0.2 , t end = 5 years, HR T = 1 were generated (Table 1), and the different values, except for HR T , were chosen based on real datasets presented in "Data from previous breast cancer studies".In order to generate data with a qualitative interaction between the marker and the treatment, scenarios with equally efficacious treatments among patients with low marker levels were generated (HR T = 1 ).In addition, we evaluated scenarios with a selection of quan- titative interactions as observed in real datasets, i.e., beneficial treatment effects in both marker levels but with different relative magnitude.We do not show these results but refer to them in the discussion.

Data from previous breast cancer studies
We based our simulation scenarios on information from five studies of different BC subtypes, namely earlystage triple negative 10 , early 11 , premenopausal stage II 12 , high-risk 13 , and stage III negative human epidermal growth factor receptor 2 14 .The endpoints considered were either time to BC relapse or death due to any cause [recurrence-free survival (RFS), disease-free survival (DFS)] or time to death due to BC [breast cancer-specific survival (BCSS)].The information we extracted from the datasets (Table 2) differed slightly from published results because we did not adjust analyses for prognostic or confounding variables.We also censored patients at 5 years to be able to compare values of the different parameters in our simulation scenarios across studies.
Three of the studies were randomized controlled trials and two were observational series of patients.In none of the studies was the evaluation of the predictive marker effect a primary objective.Marker measurements were obtained using archived specimens and were not available for all patients in the original study.Three studies evaluated the BRCA1-like marker, and two studies compared high-dose chemotherapy to conventional chemotherapy.The sample size varied from 117 to 541.The proportion of patients with high marker levels was about 50% in three studies, but only 14% and 18% in the other two studies.The experimental treatment was given to 42-58% of the patients, and the OR MT between marker and treatment was between 0.79 and 2.34.The marker was protective among patients treated with the standard treatment in two studies ( HR M = 0.67 , 0.86), but harmful in the other three studies ( HR M = 3.51 , 5.39, 6.60).In all studies, patients with the low marker level benefitted from the experimental treatment ( HR T between 0.23 and 0.87) and in only one study was the benefit  p-value for marker-treatment interaction term, pT pathological tumor stage, RCT randomized controlled trial, RFS recurrence-free survival, TAM tamoxifen, T + CEF chemotherapy with docetaxel followed by cyclophosphamide-epirubicin-fluorouracil, TN triple-negative, TX-CEX chemotherapy with capecitabine-docetaxel followed by cyclophosphamide-epirubicin-fluorouracil. a Pooled study of 6 observational studies.

Data analysis
The generated datasets were analyzed using a standard Cox proportional hazards model 9 with hazard function where h 0 was the baseline hazard function.Note that, if β I = 0 , the joint effect of marker and treatment is mul- tiplicative, i.e., the HR of a patient with a high marker level and experimental treatment vs. a patient with a low marker level and standard treatment is the product of exp (β M ) and exp (β T ) .This situation is often referred to as absence of interaction.If β I = 0 , the joint effect is super-( β I > 0 ) or submultiplicative ( β I < 0 ), and this is usually referred to as interaction.Additionally, we used the following parametrization of model (1)   for the evaluation of the treatment effect by marker level.and exp β TM high = HR TM high were the HRs for experimental vs. standard treatment in subgroups of low and high marker levels, respectively.All datasets were also analyzed by a bias-eliminating approach originally developed by Firth 15 for generalized linear models and later implemented in Cox regression 16 .In contrast to the standard Cox model, a Firth-corrected Cox model provides finite HR estimates for monotone likelihoods, i.e., when the likelihood function does not have a unique maximum and the parameter estimate of a Cox model diverges with infinite standard error.For example, in our model with the interaction term, the problem can occur when there is no patient with an event in at least one of the four marker-treatment combination subgroups.Additionally, with monotone likelihoods, maximum likelihood estimate is infinite and the likelihood function becomes highly asymmetric leading to unsuitable CI obtained based on the Wald method which assumes a normal distribution of the maximum likelihood estimate.As an alternative, the PL method for CI construction is based on the asymptotic χ 2 distribution of the log likelihood ratio test statistic 21 .Therefore, we calculated and compared results obtained with 95% CI according to Wald and PL methods.All scenarios were summarized by calculating (i) bias 1 relative % error in model standard error (ModSE) 100 ModSE EmpSE − 1 , which was the ratio between ModSE j=1 Var βI,j and empirical standard error (EmpSE) j=1 1 βl,j ≤ β I ≤ βu,j , where βl,j was the lower bound and βu,j was the upper bound of the 95% CI around βI,j , and (iv) type I error or power 1 N c N c j=1 1(p j ≤ α) , where p j was the p-value for the test β I = 0 obtained with the j-th dataset and α = 0.05 was the significance level.In all formulas, N c was the number of converged models, β I was the true value of the coefficient of the interaction term, βI,j was the estimate of the interaction coefficient in the j-th dataset, βI was the mean of all βI,j and 1 was an indicator function.Standard errors were based on the Hessian matrix.Coverage, type I error and power were calculated using both Wald and PL methods since these measures depend on the method of CI calculation.The bias and relative percentage error in standard error were additionally calculated for the treatment effect in the subgroups of low and high marker levels separately.All performance measures were defined as in 18 .
Simulation scripts were written in R version 4.3.1 using the coxphf function of the coxphf package version 1.13.4 22 and are available on request from the first author.The maximum number of iterations (maxiter) was set to 1000 and the maximum step size (maxstep) was 0.01.If the actual number of iterations for a model fit was less than the prespecified maximum number of iterations, a model was considered converged and the estimation of the interaction term, treatment effects and their standard errors were used for summary statistics.If convergence was not reached, the dataset was discarded from summary statistics.Moreover, if a generated dataset had no events in more than one combination of marker and treatment, the dataset was not used in summary statistics irrespective of convergence.PL-based power and coverage were calculated from datasets with overall model convergence and with convergence of the confidence bound required to determine whether or not the PL CI included zero or the true parameter value.However, in additional analyses, coverage and power for both CI methods were calculated including results from non-converged models as not rejecting the null hypothesis and covering the true value.

Cox model
Relative bias of the interaction effect estimate was towards and away from the null, and its magnitude depended mostly on the number of patients per dataset and the number of events in the different marker-treatment combinations, the marker effect among patients treated with the standard treatment and the proportion of patients with high marker level.Bias was usually between − 10% and 10% and monotonically approached zero for sample   Results of the simulation study for a protective ( HR M = 0.6 , left panel) and a harmful ( HR M = 3 , right panel) marker effect among patients treated with the standard treatment.The treatment HRs were HR TM low = 1 and HR TM high = 0.25 , the interaction HR was HR I = 0.25 , the OR between marker and treatment was OR MT = 1 , the proportion of patients with high marker level was p M = 0.25 , and the proportion of censored patients with low marker level receiving standard treatment was p c = 0.2 .* Curves of bias in marker low group for Cox and Firth model overlap.HR hazard ratio, OR odds ratio, PL profile likelihood, Rel.relative, SE standard error.
Table 3. Results of the simulation study for a protective ( HR M = 0.6 ), a null ( HR M = 1 ) and a harmful ( HR M = 3 ) marker effect among patients treated with the standard treatment.The treatment HRs were HR TM low = 1 and HR TM high = 0.25 , the interaction HR was HR I = 0.25 , the OR between marker and treatment was OR MT = 1 , the proportion of patients with high marker level was p M = 0.25 , and the proportion of censored patients with low marker level receiving standard treatment was p c = 0.   was HR I = 0.75 , the marker effect among patients treated with the standard treatment was HR M = 0.8 , the OR between marker and treatment was OR MT = 0.5 , and the proportion of censored patients with low marker level receiving standard treatment was p c = 0.2 .HR hazard ratio, OR odds ratio, PL profile likelihood, Rel.relative, SE standard error.
Table 4. Results of the simulation study for 25% ( p M = 0.25 ), 50% ( p M = 0.5 ) and 75% ( p M = 0.75 ) of patients with high marker level.The treatment HRs were HR TM low = 1 and HR TM high = 0.75 , the interaction HR was HR I = 0.75 , the marker effect among patients treated with the standard treatment was HR M = 0.8 , the OR between marker and treatment was OR MT = 0.5 , and the proportion of censored patients with low marker level receiving standard treatment was p c = 0.2. a Bias for β TM low , β TM high and relative bias (%) for SE( β TM low ), SE( β TM high ), β I , SE( β I ).Other parameters: HR TM low = 1 , HR TM high = 0.75 , HR I = 0.75 , HR M = 0.8 , OR MT = 0.5 , p c = 0.2.HR hazard ratio, n number of patients per dataset, N c number of converged models, OR odds ratio, PL profile likelihood, SE standard error.HR TM high = 0.5 , the interaction HR was HR I = 0.5 , the marker effect among patients treated with the standard treatment was HR M = 1 , and the proportion of censored patients with low marker level receiving standard treatment was p c = 0.2 .HR hazard ratio, PL profile likelihood, Rel.relative, SE standard error.
Table 5. Results of the simulation study when fewer ( OR MT = 0.5 ), the same ( OR MT = 1 ) and more ( OR MT = 2 ) patients with high marker level received experimental in comparison to standard treatment and the proportion of patients with high marker level was p M = 0.5 .The treatment HRs were HR TM low = 1 and HR TM high = 0.5 , the interaction HR was HR I = 0.5 , the marker effect among patients treated with the standard treatment was HR M = 1 , and the proportion of censored patients with low marker level receiving standard treatment was p c = 0.2. a Bias for β TM low , β TM high and relative bias (%) for SE( β TM low ), SE( β TM high ), β I , SE( β I ) Other parameters: HR TM low = 1 , HR TM high = 0.5 , HR I = 0.5 , HR M = 1 , p M = 0.5 , p c = 0.2 HR hazard ratio, n number of patients per dataset, N c number of converged models, PL profile likelihood, SE standard error.3, 4, 5), while for selected scenarios and smaller numbers of patients, it was high and ranged up to 72% when the proportion of patients with low marker level who received standard treatment and were censored before the end of follow-up, p c , was 20% (Supplementary Table S1).Bias was also within [− 10%, 10%] when the marker had a harmful effect on survival among patients treated with the standard treatment, i.e., HR M > 1 , but more severe bias occurred when the marker had a protective or no effect on survival in this subgroup of patients, i.e., HR M ≤ 1 , and the smaller the HR M the larger the bias (Fig. 1, Table 3).What is more, higher bias was often observed for lower proportions of patients with high marker level (Fig. 2, Table 4).Nonetheless, in scenarios with HR I = 1 , i.e., with no interaction effect, bias of the interaction coefficient was always within the interval [− 0.1, 0.1] for all parameters and sample sizes (Supplementary Table S2).Thus, when the marker had a harmful effect on survival among patients treated with the standard treatment, or a high proportion of patients had a high marker level, or there was no interaction effect, bias of the interaction effect estimate was usually acceptable.
Relative percentage error of the estimated standard error of the interaction coefficient was predominantly positive and in general its magnitude behaved similarly as the relative bias of the interaction coefficient.Scenarios with high relative bias of the interaction coefficient generally showed also high bias of its standard error (Figs. 1, 2, 3, Tables 3, 4, 5).
The most extreme bias of the estimated interaction effect (up to 72% with p c = 0.2 ) and its standard error (up to 48% with p c = 0.2 ) was observed in scenarios with a protective marker of low prevalence, strong negative interaction, and a higher or lower prevalence of marker-positive patients in the standard vs. experimental treatment group.Both biases were positive resulting in values of the interaction effect being biased towards the null and overestimated standard error, i.e., leading to smaller or no differences in the benefit from an experimental treatment compared with a control treatment by marker level and wider CIs for the comparison of the benefit.For such a combination of parameters the incidence of the event was very low in the subgroup of patients with a high marker level and experimental treatment.Subsequently, many small datasets generated under such scenarios did not have events in this subgroup so that models failed to converge (Supplementary Table S1).In general, the smaller the values of HR M and p M , and OR MT being away from 1, the larger was the number of non-converged models which ranged up to 59% (data not shown), particularly for small sample sizes.
A small number of events in at least one marker-treatment combination and positive bias of the standard error of the interaction effect estimate led to high and overestimated standard error.This resulted in very wide confidence intervals and consequently overcoverage of Wald-type 95% CIs, while coverage obtained with the PL method approached the nominal level even for smaller sample sizes (Figs. 1, 2, 3, Tables 3, 4, 5, Supplementary Table S1).Under the null, overcoverage of CIs co-occurred with a type I error below the nominal 5% level, while for scenarios with coverage close to 95%, the type I error was also close to 5% (Supplementary Table S2).
As expected, power increased with increasing sample size, higher values of the HR M and stronger interaction effect.For a given sample size, it was usually highest for markers with 50% prevalence and lowest for markers with low prevalence (25%) when HR M ≤ 1 and for markers with high prevalence (75%) when HR M > 1 .Power was relatively independent of the association between marker and treatment, and usually larger when based on the PL vs. Wald CI.In general, a large number of patients, a strong interaction effect or a strongly harmful marker (large HR M ) were needed to reach 80% power (Tables 3, 4, 5, Supplementary Table S1).
The treatment effect for the low marker level and its standard error were estimated without bias in all scenarios.However, the treatment effect and its standard error for the high marker level were estimated without or with minor bias only in scenarios with negligible bias of the interaction effect and its standard error.In other scenarios, the bias of the interaction effect and its standard error caused corresponding and similarly behaving bias of the estimates of the treatment effect and its standard error for the high marker level.Since the bias was positive or negative, both false nagative and false positive results were possible (Figs. 1, 2, 3, Tables 3, 4, 5, Supplementary Tables S1-S2).Note that we presented bias of the treatment effect for the high marker level, but its relative bias was very close to the relative bias of the interaction effect because the treatment effect for the low marker level was unbiased.
Increasing the proportion of patients with low marker level who received standard treatment and were censored before the end of follow-up, from 20 to 50% , resulted in a slightly larger bias in all estimates, a lower power but also a larger number of non-converged models.It did not change, however, the general performance of the model (Fig. 1, Table 3, Supplementary Fig. S1, Supplementary Table S3).The number of patients per dataset, the marker effect among patients treated with the standard treatment and the proportion of patients with high marker level seemed to have a bigger impact on the performance measures of converging models.However, a large number of models fail to converge.

Cox model with Firth correction
Virtually unbiased interaction estimates were obtained when (i) the marker was harmful among patients receiving standard treatment, (ii) more patients with high marker levels received experimental vs. standard treatment, (iii) no interaction was present between the marker and the treatment, or (iv) sample size was larger than 400.Bias of the interaction effect estimate occurred when the marker was protective among patients with the standard treatment, marker prevalence was low, the proportion of patients with high marker level was equal or higher with standard vs. experimental treatment, and the sample size did not exceed 400.In these situations, bias was also observed for the standard error (Figs. 1, 2, 3, Tables 3, 4, 5, Supplementary Tables S1-S2).
Convergence of models was very high for all scenarios and coverage of PL CIs was mostly at nominal level.Overcoverage occurred when the interaction effect or the standard error estimate was biased.Coverage of Wald CIs was generally higher than nominal.Under the null, the scenarios with overcoverage suffered from subnominal type I error rates (Figs.1-3, Tables 3, 4, 5, Supplementary Tables S1-S2).www.nature.com/scientificreports/In all scenarios, the treatment effect and its standard error among patients with low marker levels was unbiased.Estimation of the treatment effect and its standard error in patients with high marker levels, which generally had fewer patients and a lower incidence than the low marker level, was biased when the interaction term and its standard error were biased (Figs. 1, 2, 3, Tables 3, 4, 5, Supplementary Tables S1-S2).

Comparison of standard and Firth corrected Cox model
When comparing Cox and Firth corrected Cox model, bias was absent for sample sizes exceeding 400 with the Firth correction vs. the standard Cox model which resulted in biased estimates for sample sizes up to 600.Moreover, coverage of PL CIs was more often at nominal level and power was usually larger when a Firth correction with PL CI was applied.Additionally, the modified Cox model converged more often resulting in estimation of treatment and interaction effects in scenarios which could not be analyzed with a standard Cox model due to an insufficient number of observed events.For particular scenarios, the number of converged models was particularly higher for the modified Cox model.The higher the censoring rate the higher the difference in the number of converged models between the two approaches (Figs. 1, 2,3, Tables 3, 4, 5, Supplementary Fig. S1, Supplementary Tables S1-S3).Thus, generally implementing the Firth correction improved estimation substantially.In just a few scenarios with a protective and highly prevalent marker, a weak interaction effect and a small sample size, the Firth correction did not improve estimation (Fig. 2, Table 4).
Differences between the Firth corrected and standard Cox model became even more apparent when performance measures were obtained including results from non-converged models.Power decreased and coverage increased for the standard approach and the larger the number of non-convergence the larger the change in the performance measures.The two measures, however, were rather stable for the Firth corrected Cox model (data not shown).
Firth corrected Cox models converged usually in more than 95% of the datasets, while for the standard Cox model this was substantially less and under 50% for some scenarios.The Firth corrected Cox model performed well when results from all converging models were evaluated.Limited to results from those data sets for which both methods converged, relative bias of the (negative) interaction term in a Firth corrected Cox model was larger than for the standard Cox model in scenarios where its bias was positive.In this case, the estimate of the standard Cox model was biased towards the null and the Firth corrected estimate was even more attenuated.The standard error was then larger and the power was smaller with the Firth correction.On the other hand, in scenarios where the standard Cox model estimated the interaction coefficient with a negative bias, i.e., an overestimate, the Firth corrected estimate was less overestimated (data not shown).

Discussion
Although the evaluation of treatment heterogeneity is a field of active research and many different approaches have been proposed [23][24][25][26] , the analysis of most clinical studies with failure time endpoints relies on informal marker-specific comparisons of survival curves by treatment or formal Cox regression with a multiplicative interaction term between marker and treatment 27 .The aim of our study was to understand the properties of the latter commonly used approach in the specific situation of BC and to identify easy-to-use modifications to improve its performance.
We showed that Cox regression yields biased results for sample sizes under 600 patients in particular settings specific to studies on predictive biomarkers, and generally overestimates the standard error of the interaction coefficient.Bias is particularly severe if few events occur in one of the four marker-treatment combinations, e.g., if the marker group with the greatest treatment benefit is small because the marker is rare, if the marker is protective and additionally decreases the event rate, or if the interaction is strong and leads to a greater treatment benefit and therefore smaller event rate.We also showed that simple modifications of the analytic method, namely a Cox model whose score function is modified with a Firth correction and CIs obtained with a PL approach, reduce bias of the interaction coefficient and marker-specific treatment effects substantially, lead to nominal coverage of CIs and increase power.Moreover, the modified Cox model converged usually in more than 95% of the datasets and much more often than the standard Cox model which converged in less than 50% of the time for some scenarios.That means that the modifications allow estimation of treatment and interaction effects in situations where the commonly used statistical model does not provide any results.When the standard Cox model converged, Firth corrected results were more or less biased than standard Cox, depending on the direction of bias.Since the direction of bias is unknown for real datasets, the results we obtained suggest that the Firth correction and PL CIs should be used instead of a standard Cox model with Wald based CI for the analysis of predictive markers.The modifications are implemented in standard statistical software packages, for example, in SAS in the PROC PHREG procedure (regression parameters and PL CI, but not PL p-value, are available) 28 and the FC06 macro (regression parameters, PL CI and PL p-value are available) 29 or in R in the coxphf function 22 .However, irrespective of the statistical methods used, results of studies with less than 400 patients need to be interpreted cautiously because they rarely have sufficient power to detect interaction.
It is important to note that bias depends mainly on the marker effect among patients treated with the standard treatment.Unbiased results are obtained when marker is harmful.However, better performance for a beneficial marker cannot be achieved through recoding of the marker by changing the reference category and estimating 1/HR M for the standard treatment.This leads to an automatic recoding of the interaction effect to 1/HR I which just shuffles the different combinations of marker and treatment and changes comparison groups.However, it does not change the number of events in the different subgroups which eventually determines bias.Bias occurs if at least some of the combinations of marker and treatment have few events, caused by a low proportion of patients and/or a low event rate due to a beneficial effect of the marker or a strong treatment effect, or both.Heavy censoring also reduces the number of events and thereby increases bias and the number of non-converged models.www.nature.com/scientificreports/Our simulation study is tightly linked to the situation of markers potentially modifying the effect of systemic BC treatment.We generated and evaluated data which closely resemble actual empirical studies but were limited to qualitative interactions.In additional analyses (data not shown), we evaluated scenarios with quantitative interactions between the marker and the treatment, i.e., with treatment benefits of different magnitude at both marker levels.They showed similar results.This makes our results credible and directly relevant for this specific area.Nevertheless, our results do generally extend to other applications of the Cox proportional hazards model, e.g., cancer at other sites, if one takes the site-specific recurrence rates into account.
The results of our study may partly explain why few predictive markers for BC treatment selection have successfully graduated from preclinical candidate markers to markers used in clinical practice: small sample size and overestimation of standard errors lead to dramatically low power, with the well-known consequences of false-negative results and an increased likelihood of significant results to be false-positive 30 .Appropriate statistical methods can help to remedy the situation somewhat.Applying a Firth corrected Cox model with PL CI instead of a standard Cox model with Wald based CI may help.However, there is still a need for the development of new or the adaptation of standard statistical methods for small studies of predictive biomarkers.Ideally, of course, predictive biomarkers should be investigated in large (enough) studies with sufficient power.To our knowledge, there is currently no software available for calculation of adequate sample size for interaction analyses between two categorical variables based on a Firth-corrected Cox model with PL CIs.However, our script for simulationbased power calculation is available on request from the first author or from our website (http:// mhb-fonta nebiost atist ics.shiny apps.io/ Power-CoxFi rth/).One can also perform the sample size calculation based on a Cox model with the Power program 31 .
TM low and TM high were binary variables indicating patients receiving experimental treatment in the two marker levels, i.e., TM low = 1 if M = 0 and T = 1 , and TM low = 0 otherwise; TM high = 1 if M = 1 and T = 1 , and TM high = 0 otherwise.Here, exp β TM low = HR TM low

Figure 1 .
Figure 1.Results of the simulation study for a protective ( HR M = 0.6 , left panel) and a harmful ( HR M = 3 , right panel) marker effect among patients treated with the standard treatment.The treatment HRs were HR TM low = 1 and HR TM high = 0.25 , the interaction HR was HR I = 0.25 , the OR between marker and treatment was OR MT = 1 , the proportion of patients with high marker level was p M = 0.25 , and the proportion of censored patients with low marker level receiving standard treatment was p c = 0.2 .* Curves of bias in marker low group for Cox and Firth model overlap.HR hazard ratio, OR odds ratio, PL profile likelihood, Rel.relative, SE standard error.

Figure 2 .
Figure 2.Results of the simulation study for 25% ( p M = 0.25 , left panel) and 75% ( p M = 0.75 , right panel) of patients with high marker level.The treatment HRs were HR TM low = 1 and HR TM high = 0.75 , the interaction HR was HR I = 0.75 , the marker effect among patients treated with the standard treatment was HR M = 0.8 , the OR between marker and treatment was OR MT = 0.5 , and the proportion of censored patients with low marker level receiving standard treatment was p c = 0.2 .HR hazard ratio, OR odds ratio, PL profile likelihood, Rel.relative, SE standard error.

Figure 3 .
Figure 3. Results of the simulation study when fewer ( OR MT = 0.5 , left panel) and more ( OR MT = 2 , right panel) patients with high marker level received experimental in comparison to standard treatment and the proportion of patients with high marker level was p M = 0.5 .The treatment HRs were HR TM low = 1 and HR TM high = 0.5 , the interaction HR was HR I = 0.5 , the marker effect among patients treated with the standard treatment was HR M = 1 , and the proportion of censored patients with low marker level receiving standard treatment was p c = 0.2 .HR hazard ratio, PL profile likelihood, Rel.relative, SE standard error. https://doi.org/10.1038/s41598-024-64573-9 https://doi.org/10.1038/s41598-024-64573-9

Table 1 .
Parameter values used for generating data.